Question: Is there an earnings gap between male and female college graduates 6 years after graduation, and does the gap widen or shrink 4 years later?
college_reduced is created to store only the columns which we will use in this final project.college_reduced <- college %>%
select(
MN_EARN_WNE_MALE0_P10,MN_EARN_WNE_MALE1_P10,MN_EARN_WNE_MALE0_P6,
MN_EARN_WNE_MALE1_P6,COSTT4_A,ENRL_ORIG_YR8_RT,REGION,SATVR25,
SATVR75,SATMT25,SATMT75,SATWR25,SATWR75,SATVRMID,SATMTMID,SATWRMID,
ACTCM25,ACTCM75,ACTEN25,ACTEN75,ACTMT25,ACTMT75,ACTWR25,ACTWR75,
ACTCMMID,ACTENMID,ACTMTMID,ACTWRMID,INSTNM,TUITIONFEE_IN,TUITIONFEE_OUT
)
college_renamed <- college_reduced %>%
rename(
mean_earning_female_10yrs = MN_EARN_WNE_MALE0_P10,
mean_earning_male_10yrs = MN_EARN_WNE_MALE1_P10,
mean_earning_female_6yrs = MN_EARN_WNE_MALE0_P6,
mean_earning_male_6yrs = MN_EARN_WNE_MALE1_P6,
Institution = INSTNM,
In_state_tuition = TUITIONFEE_IN,
Out_state_tuition = TUITIONFEE_OUT,
tuition_yearly = COSTT4_A,
Enrolled_8_years = ENRL_ORIG_YR8_RT
)
mean_earning <- college_renamed %>%
select(mean_earning_female_10yrs, mean_earning_male_10yrs,
mean_earning_female_6yrs, mean_earning_male_6yrs)
gender is created to label the observations with either “female” or “male”.# create the female columns
female <- mean_earning %>%
select(mean_earning_female_10yrs, mean_earning_female_6yrs) %>%
mutate(gender = "female")
# create the male columns
male <- mean_earning %>%
select(mean_earning_male_10yrs, mean_earning_male_6yrs) %>%
mutate(gender = "male")
mean_earning_female_10yrs and mean_earning_male_10yrs columns in female and male are renamed to mean_earning_10yrs. The same will be done to the data for mean-earning after 6 years, renaming them to mean_earning_6yrs.# rename the female dataset
female_renamed <- female %>%
rename(mean_earning_10yrs = mean_earning_female_10yrs,
mean_earning_6yrs = mean_earning_female_6yrs)
# rename the male dataset
male_renamed <- male %>%
rename(mean_earning_10yrs = mean_earning_male_10yrs,
mean_earning_6yrs = mean_earning_male_6yrs)
female_renamed and male_renamed are then row combinded to mean_earning_clean for the data analysis in the next section.mean_earning_clean <- rbind(female_renamed, male_renamed)
The first step is to visualize the both the mean-earning after 6 years and 10 years in histograms and probability mass functions (PMF). The scale of the x-axis is adjusted so that the graph focuses on the majority of the data.
ggplot(mean_earning_clean,
aes(x = mean_earning_6yrs,
y = ..density..,
fill = gender)) +
geom_histogram(alpha = 0.5) +
geom_density(alpha = 0.7) +
# adjust the x-axis
coord_cartesian(xlim =combine(10000, 75000))
* The mean-earning after 6 years for female is right-skewed, centers at about $28,000, and ranges from about $10,000 to $80,000. * The mean-earning after 6 years for male is right-skewed, centers at about $30,000, and ranges from about $15,000 to $80,000.
ggplot(mean_earning_clean,
aes(x = mean_earning_10yrs,
y = ..density..,
fill = gender)) +
geom_histogram(alpha = 0.5) +
geom_density(alpha = 0.7) +
# adjust the x-axis
coord_cartesian(xlim =combine(10000, 100000))
* The mean-earning after 6 years for female is right-skewed, centers at about $30,000, and ranges from about $12,500 to $100,000. * The mean-earning after 6 years for male is right-skewed as well, centers at about $40,000, and ranges from about $18,000 to $100,000.
Based on the visualization, there is an income gap between female and male college graduates both 6 years and 10 years after graduation. The median income for male graduates is higher than that of the female graduates, and the gap increases after 10 years.
There may be outliers in the data which could affect the reliability of statistical analysis. Thus, boxplots are used here to identify the outliers.
Boxplots:
* Boxplot for the mean-earning after 6 years of graduation for college graduates is created.
ggplot(mean_earning_clean) +
geom_boxplot(aes(x = gender, y = mean_earning_6yrs))
ggplot(mean_earning_clean) +
geom_boxplot(aes(x = gender, y = mean_earning_10yrs))
* Based on the above graphics, there are mumerous extremely high values of income in both the mean-earning after 6 years and after 10 years which affect the fair representation of the summary statistics mean and standard deviation. However, they are unlikely to be data entry errors so they remain in the data sets in the following analysis. * The median will be used to compare the mean-earning because extreme values do not have significant effect on the median.
The summary statistics are calculated to further compare the income between the female and male graduates.
mean_earning_6yrs_summary_stats <- mean_earning_clean %>%
# filter the NA values in the data
filter(!is.na(mean_earning_6yrs)) %>%
# compute the summary statistics of mean_earning_6yrs
group_by(gender) %>%
summarize(
mean = mean(mean_earning_6yrs),
median = median(mean_earning_6yrs),
sd = sd(mean_earning_6yrs),
iqr = IQR(mean_earning_6yrs),
min = min(mean_earning_6yrs),
max = max(mean_earning_6yrs)
)
mean_earning_6yrs_summary_stats
| gender | mean | median | sd | iqr | min | max |
|---|---|---|---|---|---|---|
| female | 30201.77 | 28300 | 10158.18 | 11200 | 10700 | 141600 |
| male | 36577.77 | 34500 | 11994.03 | 12100 | 14800 | 166900 |
mean_earning_10yrs_summary_stats <- mean_earning_clean %>%
filter(!is.na(mean_earning_10yrs)) %>%
group_by(gender) %>%
summarize(
mean = mean(mean_earning_10yrs),
median = median(mean_earning_10yrs),
sd = sd(mean_earning_10yrs),
iqr = IQR(mean_earning_10yrs),
min = min(mean_earning_10yrs),
max = max(mean_earning_10yrs)
)
mean_earning_10yrs_summary_stats
| gender | mean | median | sd | iqr | min | max |
|---|---|---|---|---|---|---|
| female | 37290.80 | 34200 | 15020.84 | 15100 | 13800 | 232900 |
| male | 48755.33 | 45150 | 19967.72 | 17800 | 17700 | 250000 |
The previous visualizations and summary statistics all suggest that there is indeed an income gap between the male and female graduates after 6 years and 10 years of graduation. With that in mind, a hypothesis testing will be carried out to further study the question.
Hypothesis testing for the mean-earning of college graduates after 6 years of graduation:
Null hypothesis: The median for the mean-earnings after 6 years of graduation does not differ for the male and female graduates.
Alternative hypothesis: The median for the mean-earnings after 6 years of graduation does differ for the male and female graduates.
# pull the median column
mean_earning_6yrs_medians <- mean_earning_6yrs_summary_stats %>%
pull(median)
# calculate the observed diff_in_median:
# (median of male mean-earning) - (median of female mean-earning)
diff_median_6yrs_right <- mean_earning_6yrs_medians[2] - mean_earning_6yrs_medians[1]
diff_median_6yrs_left <- -(mean_earning_6yrs_medians[2] - mean_earning_6yrs_medians[1])
mean_earning_6yrs_medians_null <- mean_earning_clean %>%
specify(formula = mean_earning_6yrs ~ gender) %>%
hypothesize(null = "independence") %>%
generate(reps = 10000, type = "permute") %>%
calculate(stat = "diff in medians", order = combine("male", "female"))
# p-value on the right side
pvalue_6yrs_right <- mean_earning_6yrs_medians_null %>%
get_p_value(obs_stat = diff_median_6yrs_right, direction = "right")
# p-value on the left side
pvalue_6yrs_left <- mean_earning_6yrs_medians_null %>%
get_p_value(obs_stat = diff_median_6yrs_left, direction = "left")
# two-sided p-value
pvalue_6yrs <- pvalue_6yrs_right + pvalue_6yrs_left
pvalue_6yrs
| p_value |
|---|
| 0 |
Hypothesis testing for the mean-earning for college graduates after 10 years of graduation:
Null hypothesis: There is no difference between the median mean-earning between male and female college graduates after 10 years of graduation.
Alternative hypothesis: There is a difference between the median mean-earning between male and female college graduates after 10 years of graduation.
# pull the medians for mean-earning after 10 years.
mean_earning_10yrs_medians <- mean_earning_10yrs_summary_stats %>%
pull(median)
# calculate the observed diff_in_median:
# (median of male mean-earning) - (median of female mean-earning)
diff_median_10yrs_right <- mean_earning_10yrs_medians[2] - mean_earning_10yrs_medians[1]
diff_median_10yrs_left <- -(mean_earning_10yrs_medians[2] - mean_earning_10yrs_medians[1])
mean_earning_10yrs_medians_null <- mean_earning_clean %>%
specify(formula = mean_earning_10yrs ~ gender) %>%
hypothesize(null = "independence") %>%
generate(reps = 10000, type = "permute") %>%
calculate(stat = "diff in medians", order = combine("male", "female"))
# p-value on the right side
pvalue_10yrs_right <- mean_earning_10yrs_medians_null %>%
get_p_value(obs_stat = diff_median_10yrs_right, direction = "right")
# p-value on the left side
pvalue_10yrs_left <- mean_earning_10yrs_medians_null %>%
get_p_value(obs_stat = diff_median_10yrs_left, direction = "left")
# two-sided p-value
pvalue_10yrs <- pvalue_10yrs_right + pvalue_10yrs_left
pvalue_10yrs
| p_value |
|---|
| 0 |
Both hypothesis tests suggest that there is a difference between the median mean-earning of the male and female college graduates after 10 years of graduation. This means there exists an income gap where male graduates have a higher median mean-earning than female graduates after 6 and 10 years of graduation.
Besides conducting an hypothesis testing on the difference in medians for the mean-earning for the two groups, the effect size between the mean of the mean-earning of the two groups is also measured. Measuring the effect size is an important step to dertermine whether the income gap shrinks or widen from 6 to 10 years of graduation.
effsize library needs to be loaded to carry out this test.library(effsize)
# q-q plot for mean_earning_6yrs
ggplot(mean_earning_clean) +
geom_qq(aes(sample = mean_earning_6yrs, color = gender)) +
geom_qq_line(aes(sample = mean_earning_6yrs, color = gender)) +
labs(title = "q-q plot of mean earning after 6 years")
# q-q plot for mean_earning_10yrs
ggplot(mean_earning_clean) +
geom_qq(aes(sample = mean_earning_10yrs, color = gender)) +
geom_qq_line(aes(sample = mean_earning_10yrs, color = gender)) +
labs(title = "q-q plot of mean earning after 10 years ")
log10() to satisfy the condition of a normal distribution for a valid Cohen’s d test.ggplot(mean_earning_clean) +
geom_qq(aes(sample = log10(mean_earning_6yrs), color = gender)) +
geom_qq_line(aes(sample = log10(mean_earning_6yrs), color = gender)) +
labs(title = "q-q plot of log10(meaning earning_6yrs)")